Your homework should be completed in R Markdown or Quarto and Knitted to an html or pdf document. You will ``turn in” this homework by uploading to your GitHub Math_3190_Assignment repository in the Homework directory.
Some of the parts in this homework, like 1a, 2a, 2b, and 2c require writing down some math-heavy expressions. You may either type it up using LaTeX style formatting in R Markdown, or you can write it by hand (neatly) and include pictures or scans of your work in your R Markdown document.
Suppose we want to find the value of \(x\) and the objective function value for finding the global maximum and the minimum of the function \(f(x)=\ln(1+x^2)(1+x)e^{-x^2}\). We will implement gradient descent and Newton’s method for finding these extrema.
Find \(f'(x)\) for this function. Do this by hand as a nice derivative refresher. Feel free to check your answer using something like Wolfram Alpha.
Response: Where the triple product \(fgh\) is defined, then \(\frac{d}{dx}fgh = f'gh+fg'h+fgh'\), and so:
\[ f'(x)=\dfrac{2x(x+1)e^{-x^2}}{x^2+1}+e^{-x^2}\ln(x^2+1)(-2x^2-2x+1) \]
Write a function that implements gradient descent. Use a backtracking line search each iteration to find \(\gamma_k\): the step size. This function should take eight inputs:
FALSE.1e-10.The output of this function should be a list with the \(x\) value (or vector) that optimizes the function, the function value at that point, and the number of iterations it took to converge. This function should work in the case of one dimension or multiple dimensions since you’ll use it again in problem 2.
gradient_descent <- function(x_k, f, gradient, findmax = FALSE, max_iter = 200,
gamma_k = 1, beta = 0.5, tolerance = 1e-10) {
if (findmax) {
obj_func <- function(x) -f(x)
obj_gradient <- function(x) -gradient(x)
} else {
obj_func <- f
obj_gradient <- gradient
}
for (k in 1:max_iter) {
gamma <- gamma_k
f_eval <- f(x_k)
# line search
while (obj_func(x_k - gamma * obj_gradient(x_k)) >
obj_func(x_k) - gamma/4 * sum(obj_gradient(x_k) ^ 2) ) {
gamma <- beta * gamma
}
# Update x_k
x_new <- x_k - gamma * obj_gradient(x_k)
# Check convergence
if ( norm(x_k - x_new, type = '2') < tolerance) {
break
}
x_k <- x_new
}
results <- c(x_k, f_eval, k)
if(findmax){
print(paste("The maximum is:", f_eval))
print("Max at x_k / betas :")
print(x_k)
}else{
print(paste("The minimum is:", f_eval))
print("Min at x_k / beta:")
print(x_k)
}
print(paste("with", k, "iterations"))
}
Use your gradient descent function to find the global min of \(f(x)=\ln(1+x^2)(1+x)e^{-x^2}\). Try using several starting points. Keep track of and report the number of iterations needed to converge at the different starting points.
fun <- function(x){
f_x <- log(1+x^2)*(1+x)*exp(-x^2)
}
fun_prime <- function(x){
(2*x*(x+1)*exp(-x^2)) / (x^2 + 1) + exp(-x^2)*log(x^2 + 1)*(-2*x^2 - 2*x + 1)
}
gradient_descent(x_k= 1, f=fun, gradient=fun_prime, findmax = F, gamma=1, beta=0.5)
## [1] "The minimum is: 0.000148535426099015"
## [1] "Min at x_k / beta:"
## [1] 3.347721
## [1] "with 200 iterations"
gradient_descent(x_k= 0.5, f=fun, gradient=fun_prime, findmax = F, gamma=1, beta=0.5)
## [1] "The minimum is: 0"
## [1] "Min at x_k / beta:"
## [1] -4.373893e-09
## [1] "with 7 iterations"
gradient_descent(x_k= -0.5, f=fun, gradient=fun_prime, findmax = F, gamma=1, beta=0.5)
## [1] "The minimum is: 0"
## [1] "Min at x_k / beta:"
## [1] -3.155042e-12
## [1] "with 8 iterations"
gradient_descent(x_k= -1, f=fun, gradient=fun_prime, findmax = F, gamma=1, beta=0.5)
## [1] "The minimum is: -0.0623236248471729"
## [1] "Min at x_k / beta:"
## [1] -1.469439
## [1] "with 31 iterations"
Response: The minimum of -0.0623 is obtained in 31 iterations. The minimization becomes easily stuck around 0 for \(x_k=-0.5, 0.5\), converging in seven to eight iterations to \(0\), whereas it does not converge for \(x_k=1\)
Use your gradient descent function to find the global max of \(f(x)=\ln(1+x^2)(1+x)e^{-x^2}\). Again, try several starting points. Keep track of and report the number of iterations needed to converge at the different starting points.
x_k_minus1_max <- gradient_descent(x_k=-1 , f=fun, gradient=fun_prime, findmax = T, gamma=0.9, beta=0.5)
## [1] "The maximum is: 0.0879346149477647"
## [1] "Max at x_k / betas :"
## [1] -0.5409866
## [1] "with 13 iterations"
x_k_minus.5_max <- gradient_descent(x_k=-0.5 , f=fun, gradient=fun_prime, findmax = T, gamma=0.9, beta=0.5)
## [1] "The maximum is: 0.0879346149477647"
## [1] "Max at x_k / betas :"
## [1] -0.5409866
## [1] "with 11 iterations"
x_k_.5_max <- gradient_descent(x_k=0.5 , f=fun, gradient=fun_prime, findmax = T, gamma=0.9, beta=0.5)
## [1] "The maximum is: 0.510181611312493"
## [1] "Max at x_k / betas :"
## [1] 0.9868668
## [1] "with 14 iterations"
x_k_1_max <- gradient_descent(x_k=1 , f=fun, gradient=fun_prime, findmax = T, gamma=0.9, beta=0.5)
## [1] "The maximum is: 0.510181611312493"
## [1] "Max at x_k / betas :"
## [1] 0.9868668
## [1] "with 10 iterations"
Response: The maximum of \(0.5102\) is obtained in five iterations obtained by initializing \(x_k=1\). The maximization seems to become easily stuck in particular for negative values, maximizing at \(0.0879\) for \(x_k=-1, -0.5\).
Use the fact that \[ f''(x)=\dfrac{2e^{-x^2}}{(1+x^2)^2}\left((2x^3+2x^2-3x-1)(1+x^2)^2\ln(1+x^2)-4x^5-4x^4-3x^3-5x^2+3x+1\right) \] to implement Newton’s method to find the global max and the global min. Use the same starting points you did in parts c and d. Keep track of and report the number of iterations needed to converge at the different starting points. And yes, that second derivative is very messy! This is one reason why Newton’s method is less popular. You don’t have to write a function for this part to implement Newton’s method, but you can if you’d like.
fun_second <- function(x){
(2*exp(-x^2)/(1+x^2)^2) * ((2*x^3 +2*x^2 -3*x -1) * (1+x^2)^2
* log(1+x^2) -4*x^5 -4*x^4 -3*x^3 -5*x^2 +3*x +1)
}
# for(k in 1:maxit){
# x_new <- x_k - fun_prime(x_k)/fun_second(x_k)
# if(abs(x_k-x_new) < 1e-10){
# break
# }
# x_k <- x_new
# }
newtons_method <- function(x_k,f, f_prime, f_second, findmax = FALSE,
max_iter =200, tolerance = 1e-10) {
# objective function and gradient based on findmax
if (findmax) {
f_prime_2 <- function(x) -f_prime(x)
f_second_2 <- function(x) -f_second(x)
} else {
f_prime_2 <- function(x) f_prime(x)
f_second_2 <- function(x) f_second(x)
}
# newton's groove
for (k in 1:max_iter) {
f_eval <- f(x_k)
x_new <- x_k - f_prime_2(x_k)/f_second_2(x_k)
if( abs(x_k - x_new) < 1e-10){
break
}
x_k <- x_new
}
results <- c(x_k, f_eval, k)
if(findmax){
print("The maximum is:")
print(f_eval)
}else{
print("The minimum is:")
print(f_eval)
}
print(results)
}
###
newtons_method(x_k=-1,f=fun, f_prime=fun_prime, f_second=fun_second,
findmax = F, max_iter = 200, tolerance = 1e-10)
## [1] "The minimum is:"
## [1] -2.902035e-94
## [1] -1.485585e+01 -2.902035e-94 2.000000e+02
newtons_method(x_k=-1.03,f=fun, f_prime=fun_prime, f_second=fun_second,
findmax = F, max_iter = 200, tolerance = 1e-10)
## [1] "The minimum is:"
## [1] -0.06232362
## [1] -1.46943867 -0.06232362 7.00000000
newtons_method(x_k=1, f=fun, f_prime=fun_prime, f_second=fun_second,
findmax = T, max_iter = 200, tolerance = 1e-10)
## [1] "The maximum is:"
## [1] 0.5101816
## [1] 0.9868668 0.5101816 4.0000000
Compare the number of iterations needed for convergence for gradient descent and for Newton’s method. Response: Line backtracking with gradient descent allowed to converge on the minimum in 31 iterations and converged on the maximum in 1 iteration given the nearby \(x_k=1\). Newton’s method repeatedly failed to converge on the minimum without a very nearby initializing argument, \(1.03\leq x_k\), whereas the maximum was converged upon in 6 iterations.
Use the optimize() function in R to
find the global max and min.
opt <- optimize(fun,c(-100,100))
print(paste('Minimum at x=', opt$minimum))
## [1] "Minimum at x= -1.46943401434658"
print(paste('y=',opt$objective))
## [1] "y= -0.0623236248421882"
opt <- optimize(fun,c(-100,100), maximum=T)
print(paste('Maximum at x=', opt$maximum))
## [1] "Maximum at x= 0.986868461516571"
print(paste('y=',opt$objective))
## [1] "y= 0.510181611309466"
We implemented optimization to maximize the likelihood for a Poisson regression problem in the notes to estimate the \(\beta_i\) values for \(i=0,\dots,p-1\). Let’s do something similar to find the \(\boldsymbol{\beta}\) vector by maximizing the likelihood in logistic regression.
In logistic regression, we attempt to predict the probability of a success for a given case. We can use Bernoulli random variable for this. For a Bernoulli random variable, \(P(Y=y_i) = p_i^{y_i}(1-p_i)^{1-y_i}\) for \(y_i=0,1\) where \(p_i\) is the probability of a success. In our case, for logistic regression, we have \(\text{logit}(p_i)=\mathbf{X}_i\boldsymbol{\beta}\). That means \(p_i = \dfrac{\exp(\mathbf{X}_i\boldsymbol{\beta})}{1+\exp(\mathbf{X}_i\boldsymbol{\beta})}\).
Find the likelihood function, \(L(\boldsymbol{\beta}|\mathbf{X},\boldsymbol{y})\), for the logistic regression for a sample of size \(n\).
Show that the log likelihood function is \[ \ell(\boldsymbol{\beta}|\mathbf{X},\boldsymbol{y})=\sum_{i=1}^ny_i\ln\left(\dfrac{\exp(\mathbf{X}_i\boldsymbol{\beta})}{1+\exp(\mathbf{X}_i\boldsymbol{\beta})}\right) + \sum_{i=1}^n(1-y_i)\ln\left(\dfrac{1}{1+\exp(\mathbf{X}_i\boldsymbol{\beta})}\right) \] and then show it can be equivalently written
\[ \ell(\boldsymbol{\beta}|\mathbf{X},\boldsymbol{y}) = \sum_{i=1}^ny_i(\mathbf{X}_i\boldsymbol{\beta}) - \sum_{i=1}^n\ln\Big(1 + \exp(\mathbf{X}_i\boldsymbol{\beta})\Big). \]
Find the gradient of the log likelihood if we have three \(\beta\)’s. That is, if \(\mathbf{X}_i\boldsymbol{\beta}=\beta_0+\beta_1x_{i1}+\beta_2x_{i2}\).
In Homework 3, Problem 2, part a, you read in the adult
dataset (from the UC Irvine database). This
is the one that we used to predict whether a person makes over $50K a
year based on some other variables. The data came from the Census Bureau
in 1994 and can be found in the Data folder in my Math3190_S24 GitHub
repo. More info on the dataset can be found in the “adult.names”
file.
response <- GET("https://raw.githubusercontent.com/rbrown53/Math3190_Sp24/main/Data/adult.data")
adult_data <- readr::read_csv(content(response, as = "text"),col_names = FALSE) |>
rename(age = X1,
work_class = X2,
final_wght = X3,
education = X4,
edu_num = X5,
marital = X6,
occupation = X7,
relationship = X8,
race = X9,
sex = X10,
capital_gain = X11,
capital_loss = X12,
hrs_per_week = X13,
country_origin = X14,
salary = X15) |>
mutate(salary = as_factor(salary),
work_class = as_factor(work_class),
education = as_factor(education),
marital = as_factor(marital),
occupation = as_factor(occupation),
relationship = as_factor(relationship),
race = as_factor(race),
sex = factor(sex, level =c("Female", "Male")),
country_origin = as_factor(country_origin))
## Rows: 48842 Columns: 15
## ── Column specification ───────────────────────────────
## Delimiter: ","
## chr (9): X2, X4, X6, X7, X8, X9, X10, X14, X15
## dbl (6): X1, X3, X5, X11, X12, X13
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(adult_data)
Read in the dataset and put column names like you did in HW 3. Then
fit a logistic regression model using the glm() function
that predicts salary from age/10 and sex. Note: we should divide the age
by 10 in our model because this makes the gradient descent more stable
(for whatever reason). We can divide by 10 in the glm()
function by wrapping it in the I() function. Like this:
glm(salary ~ I(age/10) + sex, data = adult, family = binomial).
adult_logit_mod <- glm(salary ~ I(age/10) + sex,
data = adult_data, family = binomial)
summary(adult_logit_mod)
##
## Call:
## glm(formula = salary ~ I(age/10) + sex, family = binomial, data = adult_data)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.629827 0.043440 -83.56 <2e-16 ***
## I(age/10) 0.382809 0.008144 47.01 <2e-16 ***
## sexMale 1.242016 0.028505 43.57 <2e-16 ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 53751 on 48841 degrees of freedom
## Residual deviance: 48966 on 48839 degrees of freedom
## AIC: 48972
##
## Number of Fisher Scoring iterations: 4
Define y, x1, and x2.
y should be a vector of 0’s and 1’s with a 1 indicating
that the person had a salary above $50,000. x1 should be a
vector that contains all of the age values (divided by 10) and then
x2 should be an indicator variable that indicates if the
person is male or not. Taking columns from the matrix obtained using the
model.matrix() may be useful here.
design_matrix <- model.matrix(adult_logit_mod)
y <- ifelse(adult_data$salary == ">50K", 1, 0)
x_1 <- design_matrix[,2]
x_2 <- design_matrix[,3]
Use your gradient descent function from problem 1 to find estimates for \(\beta_0\), \(\beta_1\), and \(\beta_2\) for predicting salary from age and sex. Enter the following in your function:
c(0, 0, 0) as your starting point.You’ll know you did this right if the optimized values you end up with match (or are very close to) the estimates from your logistic regression model you fit in part d. This may take a minute or two to run.
start <- c(0,0,0)
# syntax for matrix multip. would be function(b, X, y){ sum(y * X %*% b)} etc.
ln_likely <- function(b){
sum( y * (b[1] + b[2]*x_1 + b[3]*x_2) ) - sum( log( 1 + exp(b[1] + b[2]*x_1 + b[3]*x_2)) )
}
ln_likely_2 <- function(b){
#exp_line <- exp(b[1] + b[2]*x_1 + b[3]*x_2)
sum(y * log((exp(b[1] + b[2]*x_1 + b[3]*x_2))/(1+exp(b[1] + b[2]*x_1 + b[3]*x_2))) )+sum((1-y)*log(1/(1+exp(b[1] + b[2]*x_1 + b[3]*x_2))))
}
nabla_likely <- function(b){
#exp_values <- exp(b[1] + b[2]*x_1 + b[3]*x_2)
#exp_values <- pmax(exp_values, .Machine$double.eps) # clipping to avoid underflow
d_beta_0 <- sum(y) - sum( exp(b[1] + b[2]*x_1 + b[3]*x_2) / (1 + exp(b[1] + b[2]*x_1 + b[3]*x_2)) )
d_beta_1 <- sum(y * x_1) - sum((x_1 * exp(b[1] + b[2]*x_1 + b[3]*x_2) ) / (1 + exp(b[1] + b[2]*x_1 + b[3]*x_2)))
d_beta_2 <- sum(y * x_2) - sum((x_2 * exp(b[1] + b[2]*x_1 + b[3]*x_2) ) / (1 + exp(b[1] + b[2]*x_1 + b[3]*x_2)))
c(d_beta_0, d_beta_1, d_beta_2)
}
gradient_descent( x_k= start, f= ln_likely, gradient= nabla_likely, findmax = T,
gamma=0.1 , beta=0.5 , max_iter = 300)
## [1] "The maximum is: -24483.07511215"
## [1] "Max at x_k / betas :"
## [1] -3.6298203 0.3828076 1.2420146
## [1] "with 174 iterations"
#gradient_descent( x_k= start, f= ln_likely, gradient= nabla_likely, findmax = T,
# gamma=0.01 , beta=0.3 , max_iter = 3000)
Response: The likelihood maximization converges on the estimated parameters equivalent to those of the glm(family=binomial) model (to the nearest 10 thousandth ) after 174 iterations.
Use the optim() function in R to find
the estimate for the \(\beta\)’s here.
This should be much faster since this function is much better optimized
(no pun intended). The results here should also be very close to the
numbers we obtained the slope in the model we fit in part d, but may not
match exactly.
max_likely_optimal <- optim(par=start, fn=ln_likely, gr=nabla_likely, control = list(fnscale = -1 ) )
max_likely_optimal$par
## [1] -3.6304477 0.3829196 1.2420241
Response: The optim() function approximates the same parameters as the glm(family=binomial) model at least to the nearest 1000th.